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Abstract 

Many fields of study in compilers give rise to the concept of a 
join point — a place where different execution paths come together. 
While they have often been treated by representing them as func- 
tions or continuations, we believe it is time to study them in their 
own right. We show that adding them to a direct-style functional 
intermediate language allows new optimizations to be performed, 
including a functional version of loop-invariant code motion. Fi- 
nally, we report on recent work on the Glasgow Haskell Compiler 
which added join points to the Core language. 

1. Introduction 

Consider this code, in a functional language: 

if (if el then e2 else e3) then e4 else e5 

Many compilers will perform a commuting conversion Km, which 
naively would produce: 

if el then (if e2 then e4 else e5) 
else (if e3 then e4 else e5) 

Commuting conversions are tremendously important in practice 
(Sec. [2]), but there is a problem: the conversion duplicates e4 and 
e5. A natural countermeasure is to name the offending expressions 
and duplicate the names instead: 

let { j4 () = e4; j5 () = e5 } 

in if el then (if e2 then j4 0 else j5 ()) 

else (if e3 then j4 0 else j5 ()) 

We describe j4 and j5 as join points, because they say where 
execution of the two branches of the outer if joins up again. The 
duplication is gone, but a new problem has surfaced: the compiler 
may allocate closures for locally-defined functions like j4 and j5. 
That is bad because allocation is expensive. And it is tantalizing 
because all we are doing here is encoding control flow: it is plain as 
a pikestaff that the “call” to j4 should be no more than a jump, with 
no allocation anywhere. That’s what a C compiler would do! Some 
code generators can cleverly eliminate the closures, but perhaps not 
if further transformations intervene. 

The reader of Appel’s inspirational book jODl may be thinking 
“Just use continuation-passing style (CPS)! ” When expressed over 
CPS terms, many classic optimizations boil down to /3-reduction 
(i.e., function application), or arithmetic reductions, or variants 
thereof. And indeed it turns out that commuting conversions fall 
out rather naturally as well. But using CPS comes at a fairly heavy 
price: the intermediate language becomes more complicated, some 
transformations are harder or out of reach, and (unlike direct style) 
CPS commits to a particular evaluation order (Sec. [8}. 

Inspired by Flanagan et al. ma , the reader may now be thinking 
“OK, just use administrative normal form (ANF)!” That paper 
shows that many transformations achievable in CPS are equally 


accessible in direct style. ANF allows an optimizer to exploit CPS 
technology without needing to implement it. The motto is: Think in 
CPS; work in direct style. 

But alas, a subsequent paper by Kennedy shows that there re- 
main transformations that are inaccessible in ANF but fall out nat- 
urally in CPS Q6). So the obvious question is this: could we extend 
ANF in some way, to get all the goodness of direct style and the 
benefits of CPS? In this paper we say “yes!”, making the following 
contributions: 

• We describe a modest extension to a direct-style A-calculus in- 
termediate language, namely adding join points (Sec. [3}. We 
give the syntax, type system, and operational semantics, to- 
gether with optimising transformations. 

• We describe how to infer which ordinary bindings are in fact 
join points (Sec. |4j. In a CPS setting this analysis is called 
contification DD, but it looks rather different in our setting. 

• We show that join points can be recursive, and that recursive 
join points open up a new and entirely unexpected (to us) opti- 
mization opportunity for fusion (Sec.|5j. In particular, this in- 
sight fully resolves a long-standing tension between two com- 
peting approaches to fusion, namely stream fusion (6) and un- 
fold/destroy fusion (27). 

• We give some metatheory in Sec. [6] including type soundness 
and correctness of the optimizing transformations. We show the 
safety of adding jumps as a control effect by establishing an 
equivalence with System F. 

• We demonstrate that our approach works at scale, in a state- 
of-the-art optimizing compiler for Haskell, GHC (Sec. [7j. 
As hoped, adding join points turned out to be a very modest 
change, despite GHC’s scale and complexity. Like any opti- 
mization, it does not make every program go faster, but it has a 
dramatic effect on some. 

Overall, adding join points to ANF has an extremely good power- 
to-weight ratio, and we strongly recommend it to any direct-style 
compiler. Our title is somewhat tongue-in-cheek, but we now know 
of no optimizing transformation that is accessible to a CPS com- 
piler but not to a direct-style one. 

2. Motivation and key ideas 

We review compilation techniques for commuting conversions, to 
expose the challenge that we tackle in this paper. For the sake of 
concreteness we describe the way things work in GHC. However, 
we believe that the whole paper is equally applicable to a call-by- 
value language. 

Case-of-case transformation Consider these function defini- 
tions: 

isNothing : : Maybe a -> Bool 
isNothing x = case x of Nothing -> True 
Just _ -> False 
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mHead : : [a] -> Maybe a 
mHead ps = case ps of [] -> Nothing 

(p:_) _ > Just p 

null : : [a] -> Bool 

null as = isNothing (mHead as) 

Here null is a simple composition of the library functions isNothing 
and mHead. When the optimizer works on null, it will inline both 
isNothing and mHead to yield: 

null as = case (case as of [] -> Nothing 

(p:_) -> Just p) of 
{ Nothing -> True; Just _ -> False } 

Executed directly, this would be terribly inefficient; if the argument 
list is non-empty we would allocate a result Just p only to im- 
mediately decompose it. We want to move the outer case into the 
branches of the inner one, like this: 

null as = case as of 

[] -> case Nothing of Nothing -> True 

Just z -> False 

p:_ -> case Just p of Nothing -> True 
Just _ -> False 

This is a commuting conversion, specifically the case-of-case 
transformation. In this example, it now happens that both inner 
case expressions scrutinize a data constructor, so they can be sim- 
plified, yielding 

null as = case as of { [] -> True; -> False } 

which is exactly the code we would have written for null from 
scratch. 

GHC does a tremendous amount of inlining, including across 
modules or even packages, so commuting conversions like this are 
very important in practice: they are the key that unlocks a cascade 
of further optimizations. 

Join point Commuting conversions have a problem, though: they 
often duplicate the outer case. In our example that was OK, but 
what about 


again, but so far they are perfectly ordinary let-bound functions, 
and as such they will be allocated as closures in the heap. But that’s 
ridiculous: all that is happening here is control flow splitting and 
joining up again. A C compiler would generate a jump to a label, 
not a call to a heap-allocated function closure! 

So, right before code generation, GHC performs a simple anal- 
ysis to identify bindings that can be compiled as join points. This 
identifies let-bound functions that will never be captured in a clo- 
sure or thunk, and will only be tail-called with exactly the right 
number of arguments. (We leave the exact criteria for Sec. [4] ) These 
join-point bindings do not allocate anything; instead a tail call to a 
join point simply adjusts the stack and jumps to the code for the 
join point. 

The case-of-case transformation, including the idea of using 
let bindings to avoid duplication, is very old; for example, both 
are features of Steele’s Rabbit compiler for Scheme |24) . In Rabbit 
the transformation is limited to booleans, but the discussion above 
shows that it generalizes very naturally to arbitrary data types. In 
this more general form, it has been part of GHC for decades m. 
Likewise, the idea of generating different (and much more efficient) 
code for non-escaping let bindings is well established in many 
other compilers GSIIIIISD as well as GHC. 

Preserving and exploiting join points So far so good, but there 
is a serious problem with recognizing join points only in the back 
end of the compiler. Consider this expression: 

case (let j x = BIG in 

case v of { A -> j 1; B -> j 2; C -> True } of 
{ True -> False; False -> True } 

Here j is a join point. Now suppose we do case-of-case on this 
expression. Treating the binding for j as an ordinary let binding 
(as GHC does today), we move the outer case past the let, and 
duplicate it into the branches of the inner case, yielding 

let j x = BIG in 
case v of 

A -> case (j 1) of { True -> False; False -> True } 

B -> case (j 2) of { True -> False; False -> True } 

C -> case True of { True -> False; False -> True } 


case (case v of { pi -> el; p2 -> e2 }) of 
{ Nothing -> BIG1; Just x -> BIG2 } 

where BIG1 and BIG2 are big expressions? We do not want to du- 
plicate these large expressions, or we would risk bloating the size of 
the compiled code, perhaps exponentially when case expressions 
are deeply nested 1 17) . It is easy to avoid this duplication by first 
introducing an auxiliary let binding: 

let { jl () = BIG1 ; j2 x = BIG2 } in 
case (case v of { pi -> el; p2 -> e2 }) of 
{ Nothing -> jl (); Just x -> j2 x } 

Now we can move the outer case expression into the arms of the 
inner case, without duplicating BIG1 or BIG2, thus: 

let { jl () = BIG1 ; j2 x = BIG2 } in 
case v of 

pi -> case el of Nothing -> jl 0 
Just x -> j2 x 
p2 -> case e2 of Nothing -> jl 0 
Just x -> j2 x 

Notice that j2 takes as its parameter the variable bound by the 
pattern Just x, whereas j 1 has no parameter^] 


The third branch simplifies nicely, but the first two do not. There 
are two distinct problems: 

1. The binding for j is no longer a join point (it is not tail-called), 
so the super-efficient code generation strategy does not apply, 
and the compiler will allocate a closure for j at runtime. This 
happens in practice: we have cases in which GHC’s optimizer 
actually increases allocation because it inadvertently destroys a 
join point. 

2. Even worse, the two copies of the outer case now scrutinize 
an uninformative call like ( j 1) . So the extra code bloat from 
duplicating the outer case is entirely wasted. And it’s a huge 
lost opportunity, as we shall see. 

So it is not enough to generate efficient code for join points; we 
must identify, preserve, and exploit them. In our example, if the 
optimizer knew that the binding for j is a join point, it could exploit 
that knowledge to transform our original expression like this: 

let j x = case BIG of True -> False 
False -> True 

in case v of 
A -> j 1 
B -> j 2 

C -> case True of { True -> False; False -> True } 


Compiling join points efficiently We call j 1 and j 2 join points 
because you can think of them as places where control joins up 


1 The dummy unit parameter is not necessary in a lazy language, but it is in 
a call-by-value language. 


This is much, much better than our previous attempt: 

• The outer case has moved into the right-hand side of the join 
point, so it now scrutinizes BIG. That’s good, because BIG 
might be a data constructor or a case expression (which would 
expose another case-of-case opportunity). So the outer case 
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now scrutinizes the actual result of the expression, rather than 
an uninformative join-point call. That solves problem (2). 

• The A and B branches do not mention the outer case, because it 
has moved into the join point itself. So j is still tail-called and 
remains an efficiently-compiled join point. That solves problem 
( 1 ). 

• The outer case still scrutinizes the branches that do not finish 
with a join point call, e.g. the C branch. 

The key idea Thus motivated, in the rest of this paper we explore 
the following very simple idea: 

• Distinguish certain let bindings as join-point bindings, and 
their (tail-)call sites as jumps. 

• Adjust the case-of-case transformation to take account of join- 
point bindings and jumps. 

• In all the other transformations carried out by the compiler, 
ensure that join points remain join points. 

Our key innovation is that, by recognising join points as a language 
construct, we both preserve join poins through subsequent transfor- 
mations, and exploit join points to make other tansformations more 
effective. Next, we formalize this approach; subsequent sections 
develop the consequences. 

3. System Fj: join points and jumps 

We now formalize the intuitions developed so far by describing 
System Fj, a small intermediate language with join points. Fj is 
an extension of GHC’s Core intermediate language OH- We omit 
existentials, GADTs, and coercions (25), since they are largely 
orthogonal to join points. 

Syntax System Fj is a simple A-calculus language in the style of 
System F, with let expressions, data type constructors, and case 
expressions; its syntax is given in Fig.^ System Fj is an explicitly- 
typed language, so all binders are typed, but in our presentation we 
will often drop the type annotations. 

The join-point extension is highlighted in the figure and consists 
of two new syntactic constructs: 

• A join binding that declares a join point. Each join point has a 
name, a list of type parameters, a list of value parameters, and 
a body. 

• A jump expression that invokes a join point, passing all in- 
dicated arguments as well as an additional type argument (as 
discussed below). 

Although we use curried syntax for jumps, join points are 
polyadic, partial application is not allowed. 

Static semantics The type system for System Fj is given in 
Fig. 0 where typeof gives the type of a constructor and ctors gives 
the set of constructors for a datatype. 

The typing judgement carries two environments, F and A, with 
A binding join points. The environment A is extended by a join 
(rules JBlND and RJBlND) and consulted at a jump. Note that we 
rely on scoping conventions in some places: if F; A h e : r, then 
every variable (type or term) free in e or r appears in T, and the 
symbols in F are unique. Similarly, every label free in e appears in 
A. 

To enforce that jumps are not used as side effects, A is reset in 
every premise for a subterm whose runtime context is not statically 
known. For example, consider join j x = RFiS in / (jump j True 
Here the context in which the jump is invoked is not stati- 
cally known — in a lazy language it depends on how / uses its 
argument — so it cannot be compiled to “adjust the stack and jump.” 
So j is not a valid join point. We exclude such terms by resetting 
A to e when typechecking the argument in rule APP. 
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Tip 
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1 
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□ | F[E\ 

Evaluation contexts 

s :: = 

e | F : s 

Stacks 


Tail contexts 

L ::= □ 

| case e of p — > L 
| let vfe inL 
| join j (? x :d = L in L' 

| join rec j ~a x:<r = L in L' 

Miscellaneous 

C £ General single-hole term contexts 
S ::= ■ | S, x:o = v Heap 

c ::= (e; s; E) Configuration 

Figure 1: Syntax of System Fj. 


Nevertheless, the typing of join points is a little bit more flexible 
than you might suspect. Consider this expression: 

/ join j x = RHS \ 

in case v of A — ¥ jump j True C2C 
B > jump j False C2C 

V C -y Ac.c 

where G2C = Char —y Char. This is certainly well typed. A 
valid transformation is to move the application to ’ x ’ into both the 
body and the right hand side of the join, thus: 

join j x = RHS ’x’ 

Int). / casecof A > jump j True C2C\ 

in ( B — » jump j False C2C ] ’x’ 

V C -> Ac.c / 

Now we can move the application into the branches: 
join j x = RHS ’ x’ 


Empty unary context 
Case branches 
Body of let 
Join point, body 
Rec join points, body 
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(x:t) G T 
T; Ah x : r 


Var 


type of(if) = Vet. ~a — > T ~a 
T; e h u : cr{</?/a} 

T; Ah - K Tp u : Tip 


r ; A h e : t 

r, (a::o-); e H e : r 


Con 


T; A h A (x:cr).e : a -A r 


Abs 


r, a; e h e : r 
T; Ah Aa.e : Va. r 


TABS 


T; A h e : <t — > t T; £ h u : a 


APP 


T : A h e : Va. r 

T; Aheu: t “ ‘ T; A h e i/> : r{y;/a} 

T; e h u : a T, x:a; A h e : r 


TAPP 


(j'MIl . (7 — > Vr. r) S A T; £ h u : frjyj/a} 


T; A h let x:rr = u in e : r 


VBind 


T; Ah jump j yj u r : r 

r, x:a; £ h u : a T, x.u : A h e : r 

p A i— i 4- > ■ RVBIND 

1; Ah let rec x:cr = u in e : r 



r, a, x:<j\ Ah u : r T; A, (j:V~a. <7 — > Vr. r) h e : r 


r\ a | • • • — ► ► • J15IND 

I; Ah join j a x:cr = wine : t 

r, e?, x:a ; A <7 — > Vr. r h u : r T; A , <7 — >■ Vr. r h e : r 

RJBIND 


T; Ah join rec j ~a x:a = u in e : r 


T; Ahe:Tp typeof (K) = VI?. hr — t T ?? 7/ = hr {y?/a} T, .r:A A h u : r ctors(T) = {A} 


T; Ah case e of K x:u -A u : T 
Figure 2: Type system for System F j. 


Jump 


Case 


incaseuof A — > (jump j True C2C) ’x’ 

B > (jump j False C2C) ’x’ 

C — > (A c.c) ’x’ 

Should this be well typed? The jumps to j are not exactly tail 
calls, but they can (and indeed must) discard their context — here 
the application to ’x’ — and resume execution at j. We will see 
shortly how this program can be further transformed to remove 
the redundant applications to ’x’, but the point here is that this 
intermediate program is still well typed, as reflected by the fact that 
A is not reset in the function part of an application (rule APP). 

The types given to join points themselves deserve some atten- 
tion. A join point that binds type variables c? and value arguments 
of types It is given the type Va. 7r — > Vr. r (rule JBlND). The 
return type indicated, namely Vr. r, is often written _L, and it in- 
dicates a non-returning function: a function which does not actu- 
ally return can be safely given any return value. This is similar to 
how Haskell’s error function has type Va. String -A a. We have 
merely moved the universal quantification to the end for consis- 
tency with the join syntax, which does not (and must noQ bind 
this “return-type parameter.” 

So a join point’s type does not reflect the value of its body, and 
a jump can have any type whatsoever. What then keeps a join point 
from returning arbitrary values? It is the JBlND rule (or its recursive 
variant) that checks the right hand side of the join point, making 
sure it is the same as that of the entire join expression. Thus we 
cannot have 

joinj = "Gotcha!" in if b then jump j 7nielse4 

because j returns a String but the body of the join returns an Int. 
In short, the burden of typechecking has moved: whereas a function 
can be declared to return any type but can only be invoked in certain 
contexts, a join point can be invoked in any context but can only 
return a certain type. 

Finally, the reader may wonder why join points are polymorphic 
(apart from the result type). In Fj as presented here, we could 
manage with monomorphic join points, but they become absolutely 
necessary when we add data constructors that bind existential type 
variables. We omitted existentials from this paper for simplicity, 


2 When we introduce the abort axiom fSec.|3p, it will need to change this 
type argument arbitrarily, which it can only safely do if the type is never 
actually used in the other parameters. 
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if ( K ~x — >■ u) G alt 
/ u{ip/a}; 

( join jb in □ : s; 

\ E, x — v 
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(AO 

(bind) 

(look) 


(case) 


(jump) 


(ans) 


Figure 3: Call-by-name operational semantics for System Fj. 


but they are very important in practice and GHC certainly supports 
them. 

Operational semantics We give System Fj an operational se- 
mantics (Fig. |3j in the style of an abstract machine. A configura- 
tion of the machine is a triple (e; s; E) consisting of an expression 
e which is the current focus of execution; a stack s representing 
the current evaluation context (including join-point bindings); and 
a heap E of value bindings. The stack is a list of frames, each of 
which is an argument to apply, a case analysis to perform, or a 
bound join point (or recursive group). Each frame is moved to the 
stack via the push rule. Most of the rules are quite conventional. 
We describe only call-by-name evaluation here, as rule look shows; 
switching to call-by-need by pushing an update frame is absolutely 
standard. 

Note that only value bindings are put in the heap. Join points 
are stack-allocated in a frame: they represent mere code blocks, not 
first-class function closures. As expected, a jump throws away its 
context (the jump rule); it does so by popping all the frames from 
the stack to the binding (as usual, -H- stands for the concatenation 
of two stacks): 

/ join j x = x \ 

\ in case (jump j 2 (Int -A Bool)) 3 of . . .; e; e I 
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/ jump j 2 ( Int — > Bool)-, \ 

>* / □ 3 : case □ of . . . : join j a; = 3 ; in □ :e; j 

i-> ( x ; join j x = x ind :e-, x = 2) 

Here three frames are pushed onto the stack: the join-point binding, 
the case analysis, and finally the application of 3 to the jump. Then 
the jump is evaluated, popping the latter two frames, replacing the 
term with the one from the join point, and binding the argument. 

The ans rule removes ajoin-point binding from the context 
once an answer A (see Fig. [Ill is computed; note that a well-typed 
answer cannot contain a jump, so at that point the binding must be 
dead code. Continuing our example: 

(a:; join j x e= x inD : e; x = 2) i — >* (2; e; x = 2) 

Optimizing transformations The operational semantics operates 
on closed configurations. An optimizing compiler, by contrast, 
must transform open terms. To describe possible optimizations, 
then, we separately develop a sound equational theory (Fig. |4|, 
which lays down the “rules of the game” by which the optimizer 
is allowed to work. It is up to the optimizer to determine how 
to apply the rules to rewrite code. All the axioms carry implicit 
scoping restrictions to avoid free-variable capture. (For example, 
drop requires that nothing bound by vb occurs free in e.) 

The ft, [3 t , and case axioms are analogues of the similarly- 
named rules in the operational semantics. Since there is no heap, /3 
and case create let expressions instead. Compile-time substitution, 
or inlining, is performed for values by inline and for join points 
by jinline. If a binding is inlined exhaustively, it becomes dead 
code and can be eliminated by the drop or jdrop axiom. Values 
may be substituted anywhere 3 ! which we indicate using a general 
single-hole context C in inline. Inlining of join points is a bit more 
delicate. A jump indicates both that we should execute the join 
point and that we should throw out the evaluation context up to 
the join point’s declaration. Simply copying the body accomplishes 
the former but not the latter. For example: 

join j ( x : Int ) = x + 1 in (jump j 2 ( Int — y Int)) 3 
If we naively inline j here, we end up with the ill-typed term: 
join j ( x : Int) = x + 1 in (2 + 1) 3 

Inlining is safe, however, if the jump is a tail call, since then there is 
no extra evaluation context to throw away. To specify the allowable 
places to inline a join point, then, we use a syntactic notion called a 
tail context. A tail context L (see Fig. 0 is a multi-hole context 
describing the places where a term may return to its evaluation 
context. Since □ 3 is not a tail context, the jinline axiom fails for 
the above term. 

The casefloat, float, and jfloat axioms perform commuting 
conversions. Of the three, jfloat is novel. It does the transformation 
we wanted to perform in Sec.[2]to avoid destroying a join point. It 
relies on a simple meta-syntactic function E\-] to push E into a 
join-point binding: 

E[j ?? = «] = (]<?? = E[u]) 

E[recj ~a ~x = u] = (rec j ~a ~x = E[u\) 

Consider again the example at the beginning of Sec. [2] With our 
new syntax, we can write it as: 

/ join j x = BIG \ 

in case v of A — y jump j 1 Bool 
case _ . . _ „ , of 

B — y jump j 2 Bool 

\ C — y True J 

{True -4 False-, False -4 True} 

We can use jfloat to move the outer case into both the right hand 
side of the join binding and into its body; use casefloat to move 
the outer case into the branches of the inner case; use abort to 


3 For brevity, we have omitted rules allowing inlining a recursive definition 
into the definition itself (or another definition in the same recursive group). 


discard the outer case where it scrutinizes a jump; and use case 
to simplify the C alternative. The result is just what we want: 

join j x = case BIG of {True -4 False-, False -4 True} 
in case v of A — y jump j 1 Bool 
B — y jump j 2 Bool 
C — y False 

The commute axiom The left-hand sides of axioms float, jfloat, 
and casefloat enumerate the forms of a tail context. That suggests 
that the three axioms are all instances of a single more general (yet 
equivalent) form: 

-E[L[I?]] = L[S[e]] ( commute ) 

To apply commute (forward) is to move the evaluation context 
into each hole of the tail context. Since the tail context describes 
the places where something is returned to the evaluation context, 
commute “substitutes” the context into the places where it is in- 
voked Q 

We can also derive new axioms succinctly using tail contexts. 
For example, our commuting conversions as written do quite a 
bit of code duplication by copying E arbitrarily many times (into 
each branch of a case and each join point). Of course, in a real 
implementation, we would prefer not to do this, so instead we might 
use a different axiom: 

E[L[e] : r] = join j x = E[x\ inZjjump j e r] 

This can be derived from commute by first applying jdrop and 
jinline backward. 

4. Contification: inferring join points 

Not all join points originate from commuting conversions. Though 
the source language doesn’t have join points or jumps, many let- 
bound functions can be converted to join points without changing 
the meaning of the program. In particular, if every call to a given 
function is a tail call, and we turn the calls into jumps, then when- 
ever one of the jumps is executed, there will be nothing to drop 
from the evaluation context (the s' in the jump rule will be empty). 

The process is a form of contification ED (or continuation 
demotion), which we describe in Fig. p] where fv(e) means the 
set of free variables of e (and similarly fv(L) for tail contexts), and 
dom(p) means the domain of the environment p (to be described 
shortly). 

The non-recursive version, contify, attempts to decompose the 
body of the let (i.e., the scope of /) into a tail context L and 
its arguments, where the arguments contain all the occurrences of 
/, then attempts to run the special partial function tail on each 
argument to the tail context. This function will only succeed if there 
are no non- tail calls to /. 

The tail function takes an environment p mapping applications 
of contifiable variables / to jumps to corresponding join points j. 
For each expression that matches the form of a saturated call to 
such an /, then, tail turns the call into a jump to its j, provided that 
none of the arguments to the function contains a free occurrence 
of a variable being contified — an occurrence in argument position 
is disallowed by the typing rules. For any other expression, tail 
changes nothing but does check that no variable being contified 
appears; otherwise, tail fails, causing the contify axiom not to 
match. 

There is one last proviso in the contify and contify r ec axioms, 
which is that the body of each function to be contified must have 
the same type as the body of the let. This can fail to occur if some 
function f is polymorphic in its return type |[8j. 

Finding bindings to which contify or contifyrec will apply 
is not difficult. Our implementation is essentially a free-variable 

4 In fact, from a CPS standpoint, commute is precisely a substitution 
operation. 
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(Xx :cr.e) v 

= let x:cr = vine 


09) 

(A a.e) ip 

II 


G8r) 

let vb in C[x\ 

= letr;6inC[u] 

if (x\g = v) £ vb 

( inline ) 

let vb in e 

= e 


(drop) 

join jb in L[~e, jump j 7p ~v r, e'] 

= join jb in L\fe , let x:a = v in u{(p/a }, e'} 

if (j a x:a = u) E jb 

(jinline) 

join jb in e 

= e 


t jdrop ) 

case K7p ~v of alt 

= let x:cr = v in e 

if ( K x:a — »• e) E alt 

(case) 

Encase e of K ~x — > u\ 

= case e of K~x — > E[u] 


(casefloat) 

E[let vb in e] 

= \etvbinE[e] 


{float) 

E[joinjb in e] 

= join E[jb] in E[e] 


(jfloat) 

E[jump j 7p ~e t\ : t' 

= jump j ip ~e t' 


(abort) 


Figure 4: Common optimizations for System F j. 



let / = A~a.\~x.u in L[e'] : r 


joinj ~a ~x = u in L[tail p (e)] 

if p(f a ~x) = jump j ~a ~x r 
and / ^ fv(L), u : r 


let reef = A~a .\~x .L[u] inL'[~e] 


tail 


p (/ u) 

tailp(e) 

tailp(e) 


= e{o/a}{u/x } 

A 

— e 

— undefined 


r = joinrecj ~a ~x = L[tail p (w)] in Z/[tail p (e)] 
if p(f ~a ~x) = jump j ~a ~x r 
and / ^ fv(L), / ^ fv(L') , L[u] : r 
if p(f ~a ~x) = e and dom(p) fl fv(rT ) = 0 
if dom(p) fl fv(e) = 0 
otherwise 


(contify) 


( contify rec ) 


Figure 5: Contification as a source-to-source transformation. 


analysis that also tracks whether each free variable has appeared 
only in the holes of tail contexts. This is much simpler than previous 
contification algorithms because we only look for tail calls. We 
invite the reader to compare to ODD or to Sec. 5 of GU, which both 
allow for more general calls to be dealt with. Yet we claim that, 
in concert with the simplifier and the Float In pass, our algorithm 
covers most of the same ground. To demonstrate, a convenient 
point of comparison is the local CPS transformation in Moby (23t , 
which produces mutually tail-recursive functions to improve code 
generation in much the same way GF1C does. Note that Moby uses 
a direct-style intermediate representation, though its contification 
pass is expressed in terms of a CPS transform. 

In essence, the final effect of Moby’s local CPS transform is to 
turn 

let f x = ... 

in E [ . . . f y ... f z ...] 

(where the calls to f are tail calls within E) into 

let { j x = E[x] ; f x = j <rhs> } 
in . . . f y . . . f z . . . 

where the tail calls to f are now compiled as efficient jumps. Note 
that f now matches the contify axiom, hut it did not before because 
of the E in the way. Nonetheless, our extended GF1C achieves the 
same effect as Moby, only in stages. Starting with: 

let f x = rhs in E[. . . f y . . . f z . . .] 

First, applying float from right to left floats / inward: 

.E[let / x = rhs in . . . f y . . . f z . . .] 

Next, contify applies, since the calls to / are now tail calls: 

/■.join f x = rhs in . . . jump / y t . . .jump f z r . . .] 


And now jfloat pushes E into the join point / and the body: 

join f x = B[rhs] in . . E[jump f y r] . . . E[jump / z r] . . .] 

From here, abort removes E from the jumps, and we can abstract 
E by running jdrop and jinline backward: 

join {j x = E[x\; fx = jump j rhs t} in . . . / y . . . f z . . . 

Thus we achieve the same result without any extra effor0 

Naturally, contification is more routine and convenient in CPS- 
based compilers mum The ability to handle an intervening 
context comes nearly “for free” since contexts already have names. 
Notably, it is still possible to name contexts in direct style (the 
Moby paper li23l does so using labelled expressions), so it is only a 
matter of convenience, not feasibility. 


5. Recursive join points and fusion 

We have mentioned, without stressing the point, that join points 
can be recursive. We have also shown that it is rather easy to 
identify let-bindings that can be re-expressed (more efficiently) 
as join points. To our complete surprise, we discovered that the 
combination of these two features allowed us to solve a long- 
standing problem with stream fusion. 


5 The parts of this sequence not specifically to do with join points were 
already implemented before in GHC: The Float In pass applies float in 
reverse, and the Simplifier regularly creates join points to share evaluation 
contexts (except that previously they were ordinary let bindings). 
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Recursive join points Consider this program, which finds the first 
element of a list that satisfies a predicate p : 

find = Aa.A(p : a—v Bool)(xs : [a]), 
let go xs = case xs of 

x : xs' — > if p x then Just x 
else go xs' 

— > Nothing 

in go xso 

Programmers quite often write loops like this, with a local defini- 
tion for go, perhaps to allow find to be inlined at a call site. Our 
first observation is this: go is a (recursive) join point! The contif- 
ication transformation of will identify go as a join point, and will 
transform the let to a join, and each call to go into a jump. More- 
over, the transformed function is much more efficient because there 
is no longer a heap-allocated closure for go. 

But it gets better! Because go is a join point, it can participate in 
a commuting conversion. Suppose, for example, that find is called 
from any like this: 

any = Aa.\(p : a—V Bool)(xs : [a]). 

case find p xs of Just _ — > True 

Nothing — > False 


The call to find can be inlined: 

any = Aa.A(p : a—V Bool)(xs : [a]). 

/join go xs = case xs of ^ 

x : xs' — ¥ if p x then Just x 

else jump go xs' (Maybe a) 
—V Nothing 

\ in jump go xs ( Maybe a) ) 

{Just_ — > True-, Nothing — > False} 


of 


Now, we have a case scrutinizing a join so we can apply axiom 
j float from Figure [4] After some easy further transformations, we 
get 

any = Aa.A(p : a— >■ Bool)(xs : [a]), 
join go xs = case xs of 

x \ xs' — > if p x then True 

else jump go xs' Bool 
| — > Nothing 

in jump go xs Bool 


Look carefully at what has happened here: the consumer (any) of 
a recursive loop (go) has moved all the way to the return point of 
the loop, so that we were able to cancel the case in the consumer 
with the data constructor returned at the conclusion of the loop. 


Stream fusion It turns out that this new ability to move a con- 
sumer all the way to the return points of a tail-recursive loop has 
direct implications for a very widely used transformation: stream 
fusion. The key idea of stream fusion is to represent a list (or array, 
or other sequence) by a pair of a state and a stepper function, thusfl 

data Stream a where 

MkStream : : s -> (s -> Step s a) -> Stream a 

There are two competing approaches to the Step type. In un- 
fold/destroy fusion, first described by Svenningsson |j26|, we have: 

data Step s a = Done I Yield s a 

Hence a stepper function takes an incoming state and either yields 
an element and a new state or signals the end. 

Now a pipeline of list processors can be rewritten as a pipeline 
of stepper functions, each of which produces and consumes ele- 
ments one by one. A typical stepper function for a stream trans- 
former looks like: 


6 Note that Stream is an existential type, so as to abstract the internal state 
type s as an implementation detail of the stream. 


next s = case <incoming step> of 

Yield s’ a -> <process element> 

Done -> <process end of stream> 

When composed together and inlined, the stepper functions become 
a nest of cases, each scrutinizing the output of the previous stepper. 
It is crucial for performance that each Yield or Done expression be 
matched to a case, much as we did with Just and Nothing in the 
example that began Sec. [2] Fortunately, case-of-case and the other 
commuting conversions that GHC performs are usually up to the 
task. 

Alas, this approach requires a recursive stepper function when 
implementing filter, which must loop over incoming elements 
until it finds a match. This breaks up the chain of cases by putting 
a loop in the way, much as our any above becomes a case on a 
loop. Hence until now, recursive stepper functions have been un- 
fusible. Coutts et al. 0 suggested adding a Skip construtor to 
Step, thus: 

data Step s a = Done I Yield s a I Skip s 

Now the stepper function can say to update the state and call again, 
obviating the need for a loop of its own. This makes filter 
fusible, but it complicates everything else! Everything gets three 
cases instead of two, leading to more code and more runtime 
tests; and functions like zip that consume two lists become more 
complicated and less efficient. 

But with join points, just as with any, Svenningsson’s original 
Skip-less approach fuses just fine! Result: simpler code, less of it, 
and faster to execute. It’s a straight win. 

6. Metatheory of Fj 

Correctness and type safety The way to “run” a program on our 
abstract machine is to initialize the machine with an empty stack 
and an empty store. Type safety, then, says that once we start the 
machine, the program either runs forever or successfully returns an 
answer. 

Proposition 1 (Type safety). If e; e F e : t, then either: 

1. The initial configuration (e; e; e) diverges, or 

2. (e; e; e) h->* (A; e; E ), for some store E and answer A. 

To establish the correctness of our rewriting axioms, we first 
define a notion of observational equivalence. 

Definition 2. Two terms e and el are observationally equivalent, 
written e = el ', if, given any stack s and store E, either 

• both (e; s; E) and (e'\ s; E) diverge, or 

• for some Ej, Ai, E 2 , and A 2 , (e; s; E) 1 — (Ai; e; Ej) and 
(e! \ s; E} 1 i* (A 2 ; e; E 2 ). 

The equational theory is sound with respect to observational 
equivalence: 

Proposition 3. If e = el , then e = el . 

Equivalence to System F The best way to be sure that Fj can 
be implemented without any headaches is to show that it is equiva- 
lent to GHC’s existing System F-based language. This would sug- 
gest that the introduction of join points does not allow us to write 
any new programs, only to implement existing programs more ef- 
ficiently. To prove the equivalence, we establish an erasure pro- 
cedure that removes all join points from an Fj term, leaving an 
equivalent System F term. 

To erase the join points, we want to apply the contify axiom 
(or its recursive variant) from right to left. However, we cannot 
necessarily do so immediately for each join point, since contify 
only applies when all invocations are in tail position. For example, 
we cannot de-contify j here: 

join j x = x + 1 in (jump j 1 ( Int —V Int )) 2 
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Simply rewriting the join point as a function and the jump as a 
function call would change the meaning of the program — in fact, it 
would not even be well-typed: 

let / = Xx.x + 1 in / 1 2 
However, if we apply abort first: 

join j x = x + 1 in jump j 1 Int 

Now the jump is a tail call, so contify applies. 

The abort axiom is not enough on its own, since the jump may 
be buried inside a tail context: 

( case b of \ 

True — > jump j 1 (Int — > Int ) I 2 
False — > jump j 3 ( Int — > Int) J 
However, this can be handled by a commuting conversion: 
join j x = x + 1 in case b of 

True -4 (jump j 1 (Int —> Int)) 2 
False — >■ (jump j 3 (Int — / Int)) 2 
And now abort applies twice and j can be de-contified. 

Lemma 4. For any well-typed term e, there is an e' such that 
e = e and every jump in e! is in tail position. 

By “tail position,” we mean one of the holes in a tail context 
that starts with the binding for the join point being called. In other 
words, given a term 

joinj IT ~x = uinLfe], 

the terms ~e are in tail position for j. 

The proof of Lemma[4]relies on the observation that the places 
in a term that may contain free occurrences of labels are precisely 
those appearing in the hole of either an evaluation or a tail context. 
For example, the CASE typing rule propagates A into both the 
scrutinee and the branc hes; no te that case □ of alt is an evaluation 
context and case e of p — > □ is a tail context. But e □ is (in call- 
by-name) neither an evaluation context nor a tail context, and APP 
does not propagate A into the argument. 

Thus any expression can be written as: 

L[E(L'[E'(...[L^[E^(e]]]...]]]\, (1) 

which is to say a tree of tail contexts alternating with evaluation 
contexts, where all free occurrences of join points are at the leaves. 
By iterating commute and abort , we can flatten the tree, rewriting 
0 to say that any expression can be written L[~e], where each 
d is a leaf from the tree in 0- Hence no d can be expressed 
as E[L[. . .]] for nontrivial, non-binding E and nontrivial L, and 
every jump to a free occurrence of a label is some a. Let us 
say a term in the above form is in commuting-normal /ornfj By 
commute and abort , every term has a commuting-normal form, 
and by construction, every jump in a commuting-normal form is a 
tail call. Thus every label can be decontijied, and we have: 

Theorem 5 (Erasure). For any closed, well-typed F j term e, there 
is a System F term e ' such that e ' = e. 

7. Join points in practice 

Is is one thing to define a calculus, but quite another to use it in 
a full-scale optimising compiler. In this section we report on our 
experience of doing so in GHC. 

Implementing join points in GHC We have implemented Sys- 
tem Fj as an extension to the Core language in GHC. Rather than 
adding two new data constructors for join and jump to the Core 

7 A join can be treated as either an evaluation context or a tail context; 
using commute to push a join inward is not necessarily helpful, however. 
8 ANF is simply commuting-normal form with named intermediate values. 


data type, we instead re-use ordinary let-bindings and function ap- 
plications, distinguishing join points only by a flag on the identifier 
itself. 

Thus, with no code changes, GHC treats join-point identifiers 
identically to other identifiers, and join-point bindings identically 
to ordinary let bindings. This is extremely convenient in practice. 
For example, all the code that deals with dropping dead bindings, 
inlining a binding that occurs just once, inlining a binding whose 
right-hand side is small, and so on, all works automatically for join 
points too. 

With the modified Core language in hand, we had three tasks. 
First, GHC has an internal typechecker, called Core Lint, that (op- 
tionally) checks the type-correctness of the intermediate program 
after each pass. We augmented Core Lint for Fj according to the 
rules of Fig. [2] 

Second, we added a simple new contification analysis to identify 
let-bindings that can be converted into join points (see Sec. [4j. 
Since the analysis is simple, we run it frequently, whenever the 
so-called occurrence analyzer runs. 

Finally, the new Core Lint forensically identified several ex- 
isting Core-to-Core passes that were “destroying” join points (see 
Sec.[2]l. Destroying a join point de-optimizes the program, so it is 
wonderful now to have a way to nail such problems at their source. 
Moreover, once Lint flagged a problem, it was never difficult to al- 
ter the Core-to-Core transformation to make it preserve join points. 
Here are some of the specifics about particular passes: 

The Simplifier is a sort of partial evaluator responsible for many 
local transformations, including commuting conversions and in- 
lining d. The Simplifier is implemented as a tail-recursive 
traversal that builds up a representation of the evaluation con- 
text as it goes; as such, implementing the jfloat and abort ax- 
ioms (Sec. [3} requires only two new behaviors: 

• (jfloat) When traversing a join-point binding, copy the eval- 
uation context into the right-hand side. 

• (abort) When traversing a jump, throw away the evaluation 
context. 

The Float Out pass moves let bindings outwards |[20l . Moving a 
join binding outwards, however, risks destroying the join point, 
so we modified Float Out to leave join bindings alone in most 
cases. 

The Float In pass moves let bindings inwards. It too can de- 
stroy join points by un-saturating them. For example, given 
let j x y = ... in j 12, the Float In pass wants to nar- 
row j’s scope as much as possible: (let jxy = ... inj) 1 
We modified Float In so that it never un-saturates a join point. 
Strictness analysis is as useful for join points as it is for ordinary 
let bindings, so it is convenient that join bindings are, by 
default, treated identically to ordinary let bindings. In GHC, 
the results of strictness analysis are exploited by the so-called 
worker/wrapper transform fl2lfl9 :l. We needed to modify this 
transform so that the generated worker and wrapper are both 
join points. We found that GHC’s constructed product result 
(CPR) analysis [3j caused the wrapper to invoke the worker 
inside a case expression, thus preventing the worker from 
being a join point. We simply disable CPR analysis for join 
points; it turns out that the commuting conversions for join 
points do a better job anyway. 


Benchmarks The reason for adding join points is to improve 
performance; expressiveness is unchanged (Sec. [6|. So does per- 
formance improve? Table |T] presents benchmark data on alloca- 
tions, collected from the standard spectral, real and shoootout 
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spectral 

Program Allocs 

real 

Program 

Allocs 

fibheaps 

-i.t% 

anna 

+0.5% 

ida 

-1.4% 

cacheprof 

-0.5% 

nucleic2 

+0.2% 

fem 

+3.6% 

para 

-4.3% 

gamteb 

-1.4% 

primetest 

-3.6% 

hpg 

-2.1% 

simple 

-0.9% 

parser 

+1.2% 

solid 

-8.4% 

rsa 

-4.7% 

sphere 

-3.3% 

( 1 8 others) 


transform 

+1.1% 

Min 

-4.7% 

(45 others) 


Max 

+3.6% 

Min 

-8.4% 

Geo. Mean 

-0.2% 

Max 

+1.1% 



Geo. Mean 

-0.4% 




shootout 



Program 

Allocs 



k-nucleotide 

-85.9% 



n-body 

-100.0% 



spectral-norm 

-0.8% 



(5 others) 




Min 

-100.0% 



Max 

+0.0% 



Geo. Mean 

n/a 



Table 1: Benchmarks from the spectral, real, and shootout 
NoFib suites. 


NoFib benchmark suites^ We ran the tests on our modified GFIC 
branch, and compared them to the GFIC baseline to which our mod- 
ifications were applied. Remember, the baseline compiler already 
recognises join points in the back end and compiles them efficiently 
(Sec. [2]l; the performance changes here come from preserving and 
exploiting join points during optimization. 

We report only heap allocations because they are a repeatable 
proxy for runtime; the latter is much harder to measure reliably. All 
tests omitted from the tables had an improvement in allocations, 
but less than 0.3%. 

There are some startling figures: using join points eliminated 
all allocations in n-body and 85.9% in k-nucleotide. We cau- 
tion that these are highly atypical programs, already hand-crafted 
to run fast. Still, it seems that our work may make it eaiser for 
performance-hungry authors to squeeze more performance out of 
their inner loops. 

The complex interaction between inlining and other transforma- 
tions makes it impossible to give guaranteed improvements. For 
example, improving a function / might make it small enough to 
inline into g , but this may cause g to become too large to inline 
elsewhere, and that in turn may lose the optimization opportuni- 
ties previously exposed by inlining g. GHC’s approach is heuris- 
tic, aiming to make losses unlikely, but they do occur, including a 
1.1% increase in allocations in spectral/transform and a 3.6% 
increase in real/f em. 

Beyond benchmarks These benchmarks show modest but fairly 
consistent improvements for existing, unmodified programs. But 
we believe that the systematic addition of join points may have a 
more significant effect on programming patterns. Our discussion 
of fusion in Sec. [5] is a case in point: with join points we can use 
skip-less unfoldr/destroy streams without sacrificing fusion. That 
knowledge in turn affects the way in which libraries are written: 
they can be smaller and faster. 


9 The imaginary suite had no interesting cases. We believe this is be- 
cause join points tend to show up only in fairly large functions, and the 
imaginary tests are all micro-benchmarks. 


Moreover, the transformation pipeline becomes more robust. In 
GHC today, if a “join point” is inlined we get good fusion behavior, 
but if its size grows to exceed the (arbitrary) inlining threshold, 
suddenly behavior becomes much worse. An innocuous change in 
the source program can lead to a big change in execution time. That 
step-change problem disappears when we formally add join points. 

8. Why not use continuation-passing style? 

Our join points are, of course, nothing more than continuations, al- 
beit second-class continuations that do not escape, and thus can be 
implemented efficiently. So why not just use CPS? Kennedy’s work 
makes a convincing argument for CPS as a compiler intermediate 
language in which to perform optimization m. 

There are many similarities between Kennedy’s work and ours. 
Notably, Kennedy distinguishes ordinary bindings (let) from con- 
tinuation bindings (letcont ), just as we distinguish ordinary bind- 
ings from join points (join); similarly, he distinguishes continua- 
tion invocations (i.e. jumps) from ordinary function calls, and we 
follow suit. But there are a number of reasons to prefer direct style, 
if possible: 

• Direct style is, well, more direct. Programs are simply easier to 
understand, and the compiler's optimizations are easier to fol- 
low. Although it sounds superficial, in practice it is a significant 
advantage of direct style; for example Flaskell programmers of- 
ten pore over the GHC’s Core dumps of their programs. 

• The translation into CPS encodes a particular order of evalua- 
tion, whereas direct style does not. That dramatically inhibits 
code-motion transformations. For example, GHC does a great 
deal of “let floating” B 20| , in which a let binding is floated out- 
wards or inwards, which is valid for pure (effect-free) bindings. 
This becomes harder or impossible in CPS, where the order of 
evaluation is prescribed. 

Fixing the order of evaluation is a particular issue when compil- 
ing a call-by-need language, since the known call-by-need CPS 
transform on is quite involved. 

• Some transformations are much harder in CPS. For exam- 
ple, consider common sub-expression elimination (CSE). In 
f (g x) (g x), the common sub-expression is easy to see. 
But it is much harder to find in the CPS version: 

letcont kl xv = letcont k2 yv = f k xv yv 
in g k2 x 

in g kl x 

• GHC makes extensive use of user-written rewrite rules as opti- 
mizing transformations E2- For example, stream fusion relies 
on the following rule, which states that turning a stream into a 
list and back does nothing [6): 

{-# RULES "stream/unstream" 

forall s. stream (unstream s) = s #-} 

In CPS, these nested function applications are more difficult 
to spot. Also, rule matching is simply easier to reason about 
when the rules are written in more-or-less the same syntax as 
the intermediate language; since the point is to write the rules 
in the source language, this calls for an intermediate language 
that doesn’t make the same radical changes that CPS makes. 

9. Related work 

Join points and commuting conversions Join points have been 
around for a long time in practice [28), but they have lacked a 
formal treatment until now. By introducing join points at the level 
at which common optimizations are applied, we're able to exploit 
them more fully. For example, stream fusion as discussed in Sec. [5] 
depends on several algorithms working in concert, including com- 
muting conversions, inlining, user-specified rewrite rules m, and 
call-pattern specialization EQ. 
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Fluet and Weeks CD describe MLton’s intermediate language, 
whose syntax is much like ours (only first-order). However, it 
requires that nontail calls be written so as to pass the result to 
a named continuation (what we would call a join point). As the 
authors note, however, this is only a minor syntactic change from 
passing the continuation as a parameter, and so the language has 
more in common with CPS than with direct style. 

Commuting conversions are also discussed by Benton et al. in a 
call-by- value setting (4). Consider: 

let z = let y = case a of { A -> el; B -> e2 } 
in e3 

in e4 

They show how to apply commuting conversions from the inside 
outward, creating functions as join points to share code, getting: 

let z = let j2 y = e3 

in case a of { A -> j2 el; B -> j2 e2 } 

in e4 
and then: 

let {jlz=e4; j2y=e3} 

in case a of { A -> jl (j2 el); B -> jl C J 2 e2) } 

They call j 1 a “useless function”: it is only applied to the result 
of j 2. It would be better to combine j 1 with j2 to save a function 
call. Their solution is to be careful about the order of commuting 
conversions, since the problem does not occur if one goes from the 
outside inward instead. However, with join points, the order does 
not matter! If we make j2 a join point, then the second step is 
instead 

join j2 y = let z = e3 in e4 

in case a of { A -> j2 el; B -> j2 e2 } 

which is the same result one gets starting from the outside. So our 
approach is more robust to the order in which transformations are 
applied. 

SSA The majority of current commercial and open-source com- 
pilers (including, for example, GCC, LLVM, Mozilla JavaScript) 
and compiler frameworks use the Static Single Assignment (SSA) 
form (7), which imposes on an assembly-like language the invari- 
ant that variables are assigned only once. If a variable might have 
different values, it is defined by a 0-node, which chooses a value 
depending on control flow. This makes data flow explicit, which in 
turn helps to simplify some optimizations. 

As it happens, SSA is inter-derivable with CPS Q) or ANF [BJ. 
Code blocks in SSA become mutually-recursive continuations in 
CPS or functions in ANF, and 0-nodes indicate the parameters at 
the different call sites. In fact, in ANF, the functions representing 
blocks are always tail-called, so adding join points to ANF gives 
a closer correspondence to SSA code — functions correspond to 
functions and join points correspond to blocks. Indeed the Swift 
Intermediate Language SIL appears to have adopted the idea of 
“basic blocks with arguments” instead of 0-nodes G3- 

Sequent calculus Previous work (§) has shown how to define 
an intermediate language, called Sequent Core, which sits in be- 
tween direct style and CPS. Sequent Core disentangles the concepts 
of “context” and “evaluation order” — contexts are invaluable, but 
Haskell has no fixed evaluation order, a fact which GHC exploits 
ruthlessly. Interestingly, the inspiration for our language’s design 
came from logic, namely the sequent calculus. The sequent calcu- 
lus is the twin brother of natural deduction, which is the foundation 
of all the other direct-style representations. In this paper, we use 
Sequent Core as our inspiration much as Flanagan et al. 03 used 
CPS, thus putting forward a new motto: Think in sequent calculus; 
work in X-calculus. 


Relation to a language with control Since Fj has a notion of 
control, it becomes natural to relate it to known control theo- 
ries such as the one developed to reason about callcc in Scheme 
©■ In fact, our language can encode callcc v as joinj x = 
ajinji;] (At/, jumpy y). By design, this encoding does not type 
in our system since the continuation variable j is free in a lambda- 
abstraction. This has repercussions on the semantics: join points 
can no longer be saved in the stack but need to be stored in the 
heap, which is precisely what is needed to implement callcc. 

10. Reflections 

Based on our experience in a mature compiler for a statically-typed 
functional language, the use of Fj as an intermediate language 
seems very attractive. Compared to the baseline of System F, Fj 
is a rather small change; other transformations are barely affected; 
the new commuting conversions are valuable in practice; and they 
make the transformation pipeline more robust. 

Although we have presented Fj as a lazy language, everything 
in this paper applies equally to a call-by-value language. All one 
needs to do is to change the evaluation context, the notion of what 
is substitutable, and a few typing rules (as described in Sec. [6]). 
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